## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity: num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ Free_SO2 : num 11 25 15 17 11 13 15 15 9 17 ...
## $ Total_SO2 : num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality.cut : Factor w/ 3 levels "(0,4]","(4,6]",..: 2 2 2 2 2 2 2 3 3 2 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides Free_SO2 Total_SO2 density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
## quality.cut
## (0,4] : 63
## (4,6] :1319
## (6,10]: 217
##
##
##
Lets first plot histogram of fixed acidity
The Fixed acidity value seems to dispaly a normal distribution with major samples exhibiting values between 6.5g/dm3 to 9.2g/dm3.
The Volatile acidity value seems to dispaly a bimodal normal distribution with major samples exhibiting values between 0.25g/dm3 to 0.79g/dm3 but on taking the log distribution the plot becomes normal distributed.
From Above plots, following observations are made:
Quality is distributed from 3 - 8. Most wine exhibit medium(5 - 6) quality. Very less percentage of wine is of good quality.
Also form above plot we can see that most of the wines fall in the range of (4,6] in terms of quality.
There are 1599 red wine in this data set with 12 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality)
The following observations are made from dataSet:
The main features in the data set are alcohol, quality and quality.cut. I’d like to determine which features are best for predicting quality of wine. I’d like to find which features are best for predicting quality of wine. I think along with alcohol, quantity of SO2 (free and total) and acidity (both fixed and volatile) might be used for predictive modeling to determine quality of wine.
SO2 (free and total), Acidity (both fixed and volatile), density are likely to contribute to quality of wine.
Yes, quality.cut is the variable added to the dataset which distributes the sample into 3 quality bins (0,4], (4,6] and (6,10].
According to all the above plots, there are some outliers in some of the features like SO2(free and total), acidity (fixed and volatile). Also the distribution for Volatile acidity apears to be bimodal normal distribution. But when taking log distribution, the plot becomes normal distributed.
Lets, run scatterplot martix and see the correlation behaviour between the features.
Scatterplot outputs shows following behaviour:
Also from above scatterplot matrix, chlorides and sulphates doesn’t seem to have any kind of effect to quality.
Positive correlation of alcohol and quality are summarized below:
There seems to be no significant bias of the alcohol content eventhough there are samples with higer Alcohol content for wine exhibiting a higher density reading for the quality levels of 3 and 5.
Negative correlation of volatile acidity and quality are summarized below:
It seems that wine with higher volatile acidity exhibiting higher density for quality levels 5,7 and 8.
## quality Mean_Volatile_Acidity Variance_Volatile_Acidity
## 1 3 0.8845000 0.10973028
## 2 4 0.6939623 0.04844842
## 3 5 0.5770411 0.02715943
## 4 6 0.4974843 0.02590885
## 5 7 0.4039196 0.02109011
## 6 8 0.4233333 0.02100000
## Standard_Deviation_Volatile_Acidity
## 1 0.3312556
## 2 0.2201100
## 3 0.1648012
## 4 0.1609623
## 5 0.1452244
## 6 0.1449138
Even though quality levels 5,7,8 exhibits higher density for volatile acidity, their mean is less than that of quality level 3 and 4. We also observe that as quality increases, the mean, variance and standard deviation decreases.
Positive correlation of Free SO2 and Total SO2 are summarized below:
Most of the points seems to be clustered around 0-20 mg / dm3 Free SO2 and 0-50 mg / dm3 Total SO2.
Lets us find how residual sugar and quality are related.
Except for quality 3, other quality rating shows higher density of residual sugar. But no pattern is observed which can help us to predict the quality of wine from residual sugar. So this is not a good attribute used to classify quality of wine.
Lets us see the relation between fixed acidity and citric acid.
Since citric acid is one of the component of fixed acid, thus exhibiting a significant positive correlation.
The positive correlation is observed. Also from range 6 - 12.5, we observe very less deviation from mean while in other values, it shows significant deviation from mean.
Most data is clustered within 0.995 g / dm3, 1.000 g / dm3 range for Citric Acid and 6.5 g / dm3, 9 g / dm3 for Volatile Acidity.
A significant negative correlation is observed (since as acidity increases, pH decreases)and most of the data is clustered around range 5 - 14 mg / dm3.
Data is clustered in the middle and some of the data is scattered around the plot and exhibiting negative correlation.
Most of the data is clustered for alcohol level less than 11% by volume and 0.5 g/dm3.
Strong negative correlation is observed which might be because volatile acidity reduces the quality of wine (negative correlation) while citric acid increases wine quality (positive correlation).
Lets us plot some box plots with quality cut to observe the outliers.
Most of the outliers seems to lie in quality range (4,6].
Most of the outliers seems to lie in quality range (4,6] which is something we observed in pH too.
In this plot, outliers only exists in qualtiy cut (4,6].
This box plot seems have minimum outliers as compared to other plots and since quality and citric is positively correlated, this feature might be used for prediction of quality.
Both the plots above seems to contain outliers in all the quality cut range.
From all the box plots we have seen, it seems that quality cut (4,6] vs other features exhibits most of the outliers which is not good for prediction models. This behaviour of outliers may be due to reason that most of the data lies in this region as observed from the scatterplot matrix and bar plot of quality cut.
Chlorides and sulphates does not exhibit any significant relationships with any other features. Although volatile acidity seems to exhibit less positive correlation with pH which is weird. Also, most of the outliers are in the quality range (4,6] and this is not good for the prediction models.
Strong relationships that I observed are:
## quality.cut Mean_Alcohol Median_Alcohol
## 1 (0,4] 10.21587 10.0
## 2 (4,6] 10.25272 10.0
## 3 (6,10] 11.51805 11.6
Something interesting emerge from this chart which is good wines concentrate when citric acid is more than 0.3 and alcohol is more than 10.5. That is, when we have certain levels of both then we have a more high quality scores.
## quality.cut Mean_Volatile_Acidity Median_Volatile_Acidity
## 1 (0,4] 0.7242063 0.68
## 2 (4,6] 0.5385595 0.54
## 3 (6,10] 0.4055300 0.37
The graph shows us that good wines constitue citric acid above 0.27 g/dm3 and volatile acidity below 0.5 g/dm3.
## quality.cut Mean_Fixed_Acidity Median_Fixed_Acidity
## 1 (0,4] 7.871429 7.5
## 2 (4,6] 8.254284 7.8
## 3 (6,10] 8.847005 8.7
Wines of quality range (4,6] lies within 0.995 g / dm3, 1.000 g / dm3 range for Citric Acid and 6.5 g / dm3, 9 g / dm3 for Volatile Acidity. Good wines seems to be distributed across the plot.
Good wines concentrate when citric acid is more than 0.3 and alcohol is more than 10.5 and also when citric acid is above 0.27 g/dm3 and volatile acidity is below 0.5 g/dm3.
Following things were also observed:
Something interesting interaction is observed, good wines concentrate when citric acid is more than 0.3 and alcohol is more than 10.5. That is, when we have certain levels of both then we have a more high quality scores.
In both figures, under 95% Confidence level, it seems in quality range (0,4] the range for prediction is high as compared to that of remaining quality range, which might be bacause there is less data to train the classifier.
Since citric acid adds freshness to the wine which we can see in the plot as the median of citric acid increases as quality increases.
In each step we can see the negative influence of volatile acidity in a wine’s quality score.
The data set contains information on almost 1599 wine sampels across 12. In initial phase, I started understanding individual variables(univariate analysis), from which I explored interesting questions and made observations. Then I explored quality of wine accross mltiple variables (bivariate analysis and multivariate analysis)
The analysis performed on the sample dataset can be summarised as below:
pH value is considered an important parameter when determining the quality of the Wine.These pH value however was found to be optimum between a value of 3.0 and 3.5 .A pH value of higher than 3.5 tends to exhibit a higher SO2 values which can be concern for people with concerns of health issues with SO2. Samples with higher alcohol content did exhibit lower SO2 counts.
Some of the learnings from the analysis were as follows:
A limitation of the current analysis is that the current data consists of samples collected from a specific region (as per documentation of data, the wine samples are from Portuguese) and the data is old (from 2009). It would be great if we can get a big and recent data set from different regions of world. I would also like to construct linear models for prediction of wine quality and to calculate the accuracy of model.s